- Objective
- Target Variable
- Missing Data
- Unique Values
- Balance
- Spread
- Initial Exploration
- Encoding Categorical Variables
- Multicollinearity
- Feature Engineering
- Data Split and Scaling
- Hyperparameter Optimization and Performance Metrics
- Model 1: Logistic Regression
- Training
- Testing
- Model 2: Random Forest
- Training
- Testing
- Model 3: XGBoost
- Training
- Testing
- Interpreting Lime Results
- Insights Drawn
To the world of finance and banking, the future is everything. More specifically, analyzing the past to understand the future is everything. One of the biggest problems faced by companies is the issue of customer attrition. Credit card customer acquisition costs around $200 per user in the United States on average (source), and in some cases it can go much higher. Building a loyal customer base secures revenue, and being able to identify customers that are about to attrite enables a company to be able to strengthen their relationship with their customer and incentivizing them to stay.
In this tutorial, I will aim to take the reader through the data science pipeline with a customer churn use case. We will explore the factors that result in credit card customer attrition through data analysis as well as machine learning, with a later inclusion of explainable ML.
Given a dataset with information from customers' bank and transaction records, we want to be able to build an ML model that does reasonably well at predicting if a customer is likely to attrite, and then we want to be able to find out why.
The dataset is a set of credit card customer records for an unnamed bank (as most are in publicly available financial datasets). The original source of this data is from a website called LEAPS, which has a walkthrough of using naive bayes classification to solve this problem. However, we will be removing the naive bayes columns from the dataset, because the focus of this data science tutorial is on exploratory data analysis and machine learning.
The dataset can be found and downloaded on Kaggle, as well.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import random
I'm uploading the dataset as a .csv file stored on my local system into this notebook.
from google.colab import files
my_file = files.upload()
Let's take a look at the dataset by storing it in a Pandas DataFrame, df.
df = pd.read_csv("BankChurners.csv")
df.head()
Right off the bat, we notice three columns (CLIENTNUM and the two columns with naive bayes results) that will not serve any purpose in this tutorial. Keeping those around will simply induce a headache until the Feature Engineering step in 5. Machine Learning, so we can remove these three columns straight away.
df.drop(columns=["CLIENTNUM", "Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1","Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2"], inplace=True)
df.head()
Let's see what Pandas can reveal to us about the size and the data types of the entries in each column, and then cross-reference that with the descriptions from the original source to identify which variables are numerical ones and which are categorical ones, and then list out what they each mean.
df.shape
df.dtypes
Now, below is a breakdown of the numerical and the categorial variables along with corresponding descriptions as listed on the original source, LEAPS.
Numerical Variables:
Customer_Age: demographic variable - customer's age in yearsDependent_count: demographic variable - number of dependentsMonths_on_book: months on book (time of relationship)Total_Relationship_Count: total no. of products held by the customerMonths_Inactive_12_mon: no. of months inactive in the last 12 monthsContacts_Count_12_mon: no. of contacts in the last 12 monthsCredit_Limit: credit limit on the credit cardTotal_Revolving_Bal: total revolving balance on the credit cardAvg_Open_To_Buy: open to buy credit line (average of last 12 months)Total_Amt_Chng_Q4_Q1: change in transaction amount (Q4 over Q1)Total_Trans_Amt: total transaction amount (last 12 months)Total_Trans_Ct: total transaction count (last 12 months)Total_Ct_Chng_Q4_Q1: change in transaction count (Q4 over Q1)Avg_Utilization_Ratio: average card utilization ratioCatergorical Variables:
Attrition_Flag: internal event (customer activity) variable - if the account is closed then 1 else 0Gender: demographic variable - M, FEducation_Level: demographic variable - Educational Qualification of the account holderMarital_Status: demographic variable - Married, Single, UnknownIncome_Category: demographic variable - annual income category of the account holderCard_Category: product variable - type of card (Blue, Silver, Gold, Platinum)Our target variable here, the one that we will focus our analysis around, is Attrition_Flag, according to the goal set forth in 1. Introduction.
There's another problem that's immediately visible when we examine the output of df.head(), which is the presence of rows with "Unknown" values. This dataset does have some missing data. Since we do not have too much information about how the data was collected, attempting to fill in these values with mean values, mode values, or some other form of extrapolation might not present us with an accurate enough picture.
It is most likely best to remove rows with "Unknown" values from the dataset since the data we have will eventually be used in ML further below.
for i in df.columns:
df = df[df[i] != "Unknown"]
df.shape
If this hadn't left us with so much complete data, it would be worth digging deeper into finding another way to combat the missing data. Fortunately, we only lost approximately 30% of the dataset to missing data, which still leaves us plenty to analyze.
With the missing data gone, let's now see how many unique values each feature holds.
pd.concat([df.nunique()], axis= 1).rename(columns={0:'Unique_Values'}).sort_values('Unique_Values')
It seems like all of the categorical variables we identified in organizing our initial list have low counts of unique values, which is great since it would be hard to make sense of categorical variables that could take on a wide range of values - they may as well be continuous beyond a point. We can also rest assured that our focus, Attrition_Flag, only has two unique values. Phew!
Now that we've been able to organize our understanding of the features we have in the dataset, let's look at some key relationships between the variables that can help inform our judgement as to what affects credit card customer attrition.
A good starting point would be to check the balance of the dataset and see what proportion of our data represents customers that actually attrited.
df.Attrition_Flag.value_counts()
This is a very imbalanced dataset, with only about 16% who attrited!
Let's get some preliminary information about each of the numerical columns in our dataset, and see what we can learn from that.
df.describe().apply(lambda s: s.apply('{0:.3f}'.format))
Let's make note of some major observations and potential explanations:
Customer_Age is centered around 46, not dropping below 26 and not exceeding 73. This could indicate that this bank doesn't focus on the lower-age demographic through means like e-banking or a smartphone app (even though those may be options), and there don't seem to be any accounts listed for minors.Dependent_count shows us most customers have at least two dependents, spanning all the way up to five, which seems fitting given the spread of the customers' ages.Months_on_book seems to range from accounts that have been open for just over a year to accounts that have been open for about four years. A little unexpected, given that the age range is on the higher side, but perhaps in the data we've been given the bank wanted to focus on the younger accounts which have a naturally higher attrition propensity.Months_Inactive_12_mon certainly seems like an important feature to retain as a telltale sign of a customer about to churn.Contacts_Count_12_mon ranges from no contacts to six contacts in the last year. On the surface, this may seem like more contacts is a good thing, since it signals a lack of customer inactivity. However, that's not something to immediately conclude without knowing if the contacts were about, for example, a malfunctioning product.Total_Revolving_Bal and Avg_Utilization_Ratio would probably be more likely to stay.plt.figure(figsize=(6, 8))
sns.violinplot(data=df, x="Attrition_Flag", y="Total_Revolving_Bal", palette="mako")
plt.title("Distributions of Total_Revolving_Bal Across Attrition")
plt.figure(figsize=(6, 8))
sns.violinplot(data=df, x="Attrition_Flag", y="Avg_Utilization_Ratio", palette="mako")
plt.title("Distributions of Average Card Utilization Ratio Across Attrition")
It seems that these factors do make a noticeable difference, as expected - when these values are higher, the customer seems less likely to attrite. While we're at it, let's take a look at the difference between the number of products customers have held when they've stayed versus attrited.
plt.figure(figsize=(10, 5))
sns.countplot(data=df, x="Total_Relationship_Count", hue="Attrition_Flag", palette="mako")
plt.title("Counts of Attrition Across Numbers of Products Held")
This is also a useful piece of information to us! Looking at the relative distributions, customers with more products with the bank are more likely to stay, and those with less products are more likely to attrite.
Intuitively, there should be a link between Customer_Age and Months_on_book, since older customers would tend to have longer open accounts. Let's examine a scatter plot between those two variables with a regression line.
plt.figure(figsize=(10,5))
sns.regplot(data=df, x="Customer_Age", y="Months_on_book")
plt.title("Trend Between Age and Months On Book")
That is a very strong correlation! It does make logical sense for this trend to exist, as previously noted, but this could pose problems when we search for collinearity in our data later on.
Another intriguing aspect of the shape of the scatter plot here is the values of Months_on_book centering heavily around a singular value between 30 and 40. Let's examine that distribution.
plt.figure(figsize=(10,5))
sns.countplot(data=df, x="Months_on_book", hue="Attrition_Flag", palette="mako")
plt.title("Count Distribution of Months On Book")
That is such a heavy center of the distribution! While the overall spread looks symmetrical, it's strange that so many of our datapoints lie squarely at 36 months of having an open account with the bank. We don't have too much temporal context about the collection of this data to be able to guess why, but this variable probably would not be doing our ML model too many favors.
Does Customer_Age's effect on Attrition_Flag have any surprises in store for us?
plt.figure(figsize=(10,5))
sns.violinplot(data=df, x="Attrition_Flag", y="Customer_Age", palette="mako")
plt.title("Distributions of Age Across Attrition")
There's barely any difference between the the age distributions of customers who attrited and customers who stayed with the bank! At least, that's what the data we have tells us. It doesn't seem lke there's a major discernible pattern here, save for a few outliers in the existing customers' distribution.
Another factor that we'd think would affect Attrition_Flag is Income_Category - wouldn't higher income people be less likely to attrite? Examining that distribution could tell us if our little informal hypothesis is accurate.
plt.figure(figsize=(10,5))
sns.countplot(x="Income_Category", data=df, hue="Attrition_Flag", palette="mako")
Just by eyeballing it, the income category distributions between the existing and the attrited customers don't look all that different. Maybe it doesn't affect Attrition_Flag as much as we thought. A more rigorous analysis further below can confirm that.
We've been able to identify some high-level information, examine a couple preliminary distributions, and make some broad observations. However, this is only giving us part of the picture. We need to look at all the features we have at hand and see what we can analyze in terms of both interdependence of the feature set as well as the target variable, Attrition_Flag.
In order for us to be able to assess relationships between our variables further, it would be best to convert the categorical values we have into a numerical form as well for better comparison. Not to mention, ML models don't digest unless they're made into numbers!
Unfortunately, an ordinal encoding may (by definition) introduce some level of unintended ordering in those variables, but the other option is to convert those values into dummy columns (one-hot-encoded), which would be harder to search for correlations in. We'll go ahead and encode them ordinally.
Note: Attrition_Flag will be encoded such that 0 represents a customer who stayed and 1 represents a customer who attrited, since the problem is focused towards the event of attrition.
categories_encoded = {
"Attrition_Flag": {"Existing Customer": 0, "Attrited Customer": 1},
"Gender": {"M": 1, "F": 2},
"Education_Level": {"Uneducated": 1, "High School": 2, "College": 3, "Graduate": 4, "Post-Graduate": 5, "Doctorate": 6},
"Marital_Status": {"Single": 1, "Divorced": 2, "Married": 3},
"Income_Category": {"Less than $40K": 1, "$40K - $60K": 2, "$60K - $80K": 3, "$80K - $120K": 4, "$120K +": 5},
"Card_Category": {"Blue": 1, "Silver": 2, "Gold": 3, "Platinum": 4}
}
df.replace(categories_encoded, inplace=True)
df.head()
Fortunately, most of those variables work fine with an ordered encoding! That means that there is a broad level of natural ordering that can be provided to them that makes sense, such as increasing Education_Level and increasing Income_Category.
The only one that really doesn't have an order to it is Gender, but we simply have to pick an arbitrary order to continue with analysis. Let's just confirm that we're left with only numerical values in each column before moving onto more advanced EDA.
df.dtypes
Multicollinearity is the presence of independent variables that are interrelated to a high degree, making it difficult to distinguish their individual effects on the target variable.
In our case, we want to identify which of our features contribute to multicollinearity so that we can make it easier for our ML model coming in later to distinguish between these factors in predicting Attrition_Flag.
From the initial analysis, it does seem like the exact relation between attrition and most major features is hard to pinpoint. One commonly used way to measure which variables are contributing most to multicollinearity is VIF (Variance Inflation Factor).
VIF is given by the formula:
where R^2 is the coefficient of determination in linear regression. Since it intrinsically uses the same idea as linear regression, but applied between each variable and all other variables, VIF is a good way to detect collinearity amongst variables. A VIF value much larger than 10 is best avoided.
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_independent_vars = df.iloc[:, 1:18] # only examining the features
vif_data = pd.DataFrame()
vif_data["Feature"] = vif_independent_vars.columns
vif_data["VIF"] = [variance_inflation_factor(vif_independent_vars.values, i) for i in range(len(vif_independent_vars.columns))]
vif_data
There are some variables (Credit_Limit, Total_Revolving_Bal, and Avg_Open_To_Buy) with inf VIF values, meaning they are perfectly correlated with some other variables, reducing the so-called "independence" in our independent variable set (or feature set). We also see that Customer_Age and Months_on_book also have very high VIF values.
We could go into a feature-by-feature analysis of p-values or ANOVA to truly pick apart the underlying influences, but that would get rather tedious given how many features we have, especially when part of this project's goal is to let a machine learning model decide what's important!
That isn't to say that we should ignore those interrelated features, however. There's another big-picture way we can use to quickly understand what features are correlated with what others. That big-picture way (literally, the figure is quite large) is Seaborn's heatmap, which shows us one-on-one interactions between variables.
plt.figure(figsize=(17, 15))
sns.heatmap(data=df.corr(), annot=True)
plt.ylabel("Feature 1")
plt.xlabel("Feature 2")
plt.title("Heatmap of Features")
As we can see, the same variables from the VIF analysis that we noticed were very correlated with other variables are indeed reflected in the heatmap, but now on a one-on-one interaction each. Below is a list of the major "red flags" for us (so to speak) and the other variables they show correlations with:
Customer_Age, Months_on_bookCredit_Limit, Gender, Income_Category, Card_Category, Avg_Open_To_Buy, and Avg_Utilization_RatioTotal_Revolving_Bal, Avg_Utilization_RatioTotal_Trans_Ct, Total_Trans_AmtLogically, there do seem to be links between these variables, as previously alluded to. For example, like we saw in the initial exploration we did, the older a customer is, the more likely they'll have an open account with the same bank for longer. Another example is that the total transaction amount goes up with the total transaction count.
However, having multiple such variables could pose a problem in training our machine learning model. We'll try to either get rid of them or convert them into more useful values in the Feature Engineering part of 5. Machine Learning.
With all the analysis we've done so far, we've been able to derive some degree of statistics-based human judgement as to how the variables we have interact with each other.
The goal for us now is to create a machine learning model to learn to take in several input features about a customer and predict whether that customer is likely to attrite or not.
Before that, however, let's act on the insights we've gained from 4. Exploratory Data Analysis. We've made some observations in our initial basic exploration, and we've noticed some features contribute to multicollinearity. Let's use that understanding to transform our feature set before we feed it into a model we think will work well here.
Here's a little checklist of things we can modify about our data:
Months_on_book. Reason: peculiar distribution of data as well as heavy contribution to multicollinearity.Avg_Utilization_Ratio. Reason: heavy contribution to multicollinearity with several variables, and not much correlation with Attrition_Flag.Avg_Open_To_Buy. Reason: heavy contribution to multicollinearity with several variables, and not much correlation with Attrition_Flag.Gender. Reason: some contribution to multicollinearity without much impact on Attrition_Flag.Credit_Limit. Reason: heavy contribution to multicollinearity without much impact on Attrition_Flag.Customer_Age. Reason: some contribution to multicollinearity without much impact on Attrition_Flag.Total_Trans_Amt by Total_Trans_Ct to get Amt_Per_Trans, and then drop Total_Trans_Amt and Total_Trans_Ct. Reason: an average amount per transaction is a better indicator of the customer's interest in continuing to use their credit card than two aggregate values which each contribute to multicollinearity.divide contacts by products
df["Amt_Per_Trans"] = df["Total_Trans_Amt"]/df["Total_Trans_Ct"]
df.drop(columns=["Total_Trans_Ct", "Total_Trans_Amt", "Months_on_book", "Avg_Utilization_Ratio", "Customer_Age", "Avg_Open_To_Buy", "Gender", "Credit_Limit"], inplace=True)
df.head()
df.shape
As a certain legendary CMSC320 professor once said to his Spring 2021 class, it's important to allow the next stage of the data science pipeline to inform the previous stage, so that we optimize our approach over time. In lieu of that, let's check if we've combatted the issue of multicollinearity in our variables.
vif_independent_vars = df.iloc[:, 1:13]
vif_data = pd.DataFrame()
vif_data["Feature"] = vif_independent_vars.columns
vif_data["VIF"] = [variance_inflation_factor(vif_independent_vars.values, i) for i in range(len(vif_independent_vars.columns))]
vif_data
The VIF values are looking much better, and much closer to our desired range! It's time to finally move onto the next step.
In ML, there's a four-way division of the data: the features-label split and the training-testing split.
The features-label split allows us to explicitly distinguish the target variable, or y (in our case, the label, which is Attrition_Flag), from the remaining independent variables, or x (everything we've been left with after our feature engineering step).
Let's shuffle the dataset for good measure before proceeding with splitting.
df = df.sample(frac=1).reset_index(drop=True) # shuffling dataset
Below, df_y will store the label as a Pandas DataFrame, whereas y will store it as a NumPy Ndarray.
df_y = df[["Attrition_Flag"]]
y = df_y.to_numpy()
y
Now, because the values we have in each column of our feature set vary so widely across their minimum and maximum values, it's a good idea to apply some sort of scaling (or normalization, in a sense) to the data so that the machine learning model can converge more efficiently. We'll apply scikit-learn's StandardScaler, which scales the values of each column down based on their standard deviations from the mean.
As we did with the label, we'll store the features as a Pandas DataFrame in df_x, and as a NumPy Ndarray (after scaling) in x.
df_x = df.iloc[:, 1:13]
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x = scaler.fit_transform(df_x)
x
And now, the second two-way split of the data: the training-testing split. This is important because we want to distinguish the seen data from the unseen data. This means that the machine learning model will attempt to "converge", or reach an "understanding" (optimal configuration of parameters), over the seen data, or the training set.
In essence, it uses the training set to try to learn what makes a customer attrite.
For us to gauge how well it has trained itself, we can assess its performance on the unseen data, or testing set, and make use of some common performance metrics to evaluate it.
We'll go with the common training:testing set ratio of 80:20.
We can go ahead and split x into x_train and x_test and y into y_train and y_test.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
Now that we've prepared our data to be trained and tested on, let's get our first ML model on the job!
In the machine learning models that are about to follow, you will notice that they are not running entirely on stock/default settings. At the time of creating this notebook, I conducted some basic hyperparameter tweaking to produce reasonably good performances from each model.
The performance metrics we'll rely on in assessing training and testing performance will be accuracy, precision, recall, and AUC-ROC. Out of all of these metrics, our primary focus will be on maximizing following two metrics to a reasonable extent:
Here is a quick visual summary of performance metrics:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import roc_auc_score
This seems like a good place to start with this problem. After all, a logistic regression model weights multiple inputs and compresses it into a prediction between 0 and 1. Let's see how it does.
from sklearn.linear_model import LogisticRegression
model_1 = LogisticRegression(C=500, max_iter=2000)
model_1.fit(x_train, y_train)
train_predictions = model_1.predict(x_train)
cm_train = confusion_matrix(y_train, train_predictions)
sns.heatmap(cm_train, annot=True, fmt="d", cmap='Blues')
plt.ylabel("Ground Truth")
plt.xlabel("Predicted")
plt.title("Training Confusion Matrix")
print("Training Accuracy: ", accuracy_score(y_train, train_predictions))
print("Training Precision:", precision_score(y_train, train_predictions))
print("Training Recall: ", recall_score(y_train, train_predictions))
print("Training AUC-ROC: ", roc_auc_score(y_train, train_predictions))
test_predictions = model_1.predict(x_test)
cm_test = confusion_matrix(y_test, test_predictions)
sns.heatmap(cm_test, annot=True, fmt="d", cmap='Blues')
plt.ylabel("Ground Truth")
plt.xlabel("Predicted")
plt.title("Testing Confusion Matrix")
print("Testing Accuracy: ", accuracy_score(y_test, test_predictions))
print("Testing Precision: ", precision_score(y_test, test_predictions))
print("Testing Recall: ", recall_score(y_test, test_predictions))
print("Testing AUC-ROC: ", roc_auc_score(y_test, test_predictions))
Our logistic regression model did not do so well on this task. We have lots of false negatives, as our low recall score and our confusion matrix show us.
Let's see if our luck goes up with a random forest classifier! After all, it's a meta estimator (averaging the results of individual decision trees). Hopefully, it'll perform better than the singular logistic regression model we used.
from sklearn.ensemble import RandomForestClassifier
model_2 = RandomForestClassifier(max_depth=40)
model_2.fit(x_train, y_train)
train_predictions_2 = model_2.predict(x_train)
cm_train_2 = confusion_matrix(y_train, train_predictions_2)
sns.heatmap(cm_train_2, annot=True, fmt="d", cmap='Blues')
plt.ylabel("Ground Truth")
plt.xlabel("Predicted")
plt.title("Training Confusion Matrix")
print("Training Accuracy: ", accuracy_score(y_train, train_predictions_2))
print("Training Precision:", precision_score(y_train, train_predictions_2))
print("Training Recall: ", recall_score(y_train, train_predictions_2))
print("Training AUC-ROC: ", roc_auc_score(y_train, train_predictions_2))
Those values seem just a little suspicious, but it's not uncommon for random forests to perform like this on training data. It does stink of overfitting, though.
test_predictions_2 = model_2.predict(x_test)
cm_test_2 = confusion_matrix(y_test, test_predictions_2)
sns.heatmap(cm_test_2, annot=True, fmt="d", cmap='Blues')
plt.ylabel("Ground Truth")
plt.xlabel("Predicted")
plt.title("Testing Confusion Matrix")
print("Testing Accuracy: ", accuracy_score(y_test, test_predictions_2))
print("Testing Precision: ", precision_score(y_test, test_predictions_2))
print("Testing Recall: ", recall_score(y_test, test_predictions_2))
print("Testing AUC-ROC: ", roc_auc_score(y_test, test_predictions_2))
There we are, our recall finally crossed 0.5! This is a decent improvement on the logistic regression model. Our AUC-ROC score went up as well, so the random forest classifier seems to be better at distinguishing between labels.
Naturally, we can't let a machine learning problem go by without giving extreme gradient boosting a shot at it. Hopefully, we can beat the random forest classifier's scores, and if we're successful, we can move onto explaining attrition using this model.
from xgboost import XGBClassifier
model_3 = XGBClassifier(use_label_encoder=False, learning_rate=1, n_estimators=70)
model_3.fit(x_train, y_train)
train_predictions_3 = model_3.predict(x_train)
from sklearn.metrics import confusion_matrix
cm_train_3 = confusion_matrix(y_train, train_predictions_3)
sns.heatmap(cm_train_3, annot=True, fmt="d", cmap='Blues')
plt.ylabel("Ground Truth")
plt.xlabel("Predicted")
plt.title("Training Confusion Matrix")
print("Training Accuracy: ", accuracy_score(y_train, train_predictions_3))
print("Training Precision:", precision_score(y_train, train_predictions_3))
print("Training Recall: ", recall_score(y_train, train_predictions_3))
print("Training AUC-ROC: ", roc_auc_score(y_train, train_predictions_3))
test_predictions_3 = model_3.predict(x_test)
cm_test_3 = confusion_matrix(y_test, test_predictions_3)
sns.heatmap(cm_test_3, annot=True, fmt="d", cmap='Blues')
plt.ylabel("Ground Truth")
plt.xlabel("Predicted")
plt.title("Testing Confusion Matrix")
print("Testing Accuracy: ", accuracy_score(y_test, test_predictions_3))
print("Testing Precision: ", precision_score(y_test, test_predictions_3))
print("Testing Recall: ", recall_score(y_test, test_predictions_3))
print("Testing AUC-ROC: ", roc_auc_score(y_test, test_predictions_3))
It looks like we definitely did beat the random forest classifier! Our recall is around 0.6, which is not as high as it could get with more tuning or a different model, but this is sufficient to move onto explaining our champion model's predictions.
We've run three different ML models on the data in increasing order of test performance. Given time, we could play with and tweak these models forever. We could even delve into deep learning models, which would train for longer and probably perform better on test data. However, the goal of this project is not really on optimal performance - rather, it is on being able to explain the link between factors that resulted in attrition.
To preserve that purpose, we'll use a tool called LIME (Local Interpretable Model-agnostic Explanations) on our best-performing ML model, our XGBoost classifier.
Let's first install the package into the environment.
!pip install lime
We can instantiate an explainer object that can take in individual datapoints and our chosen model's prediction on that datapoint, and then show us what factors the model deemed most relevant in that case of attrition.
import lime
import lime.lime_tabular
list_of_features = df_x.columns
explainer = lime.lime_tabular.LimeTabularExplainer(
x_train,
mode="classification",
feature_names=list_of_features,
)
Let's store a list of all the indices of our test set that had attrited customers (label 1) that our model was able to successfully identify, so that LIME can analyze why the model thought that customer left.
In other words, we want to find a few true positives out of our ML model's predictions to understand why each of them attrited.
LIME works best as of now at being able to explain individual data points, which is why we'll be running it only on a small set of true positives out of the whole list we're collecting below.
true_positive_indices = []
for i in range(len(y_test)):
if y_test[i] == 1 and model_3.predict(x_test[i].reshape(1,-1)) == 1:
true_positive_indices.append(i)
For demonstration, let's run our LIME explainer on the first three correctly identified cases of attrition on our list of true_positive_indices.
idx = true_positive_indices[0]
print("Prediction:", model_3.predict(x_test[idx].reshape(1,-1)))
print("Actual: ", y_test[idx])
explanation = explainer.explain_instance(
x_test[idx],
model_3.predict_proba,
num_features=len(list_of_features))
explanation.show_in_notebook()
idx = true_positive_indices[1]
print("Prediction:", model_3.predict(x_test[idx].reshape(1,-1)))
print("Actual: ", y_test[idx])
explanation = explainer.explain_instance(
x_test[idx],
model_3.predict_proba,
num_features=len(list_of_features))
explanation.show_in_notebook()
idx = true_positive_indices[2]
print("Prediction:", model_3.predict(x_test[idx].reshape(1,-1)))
print("Actual: ", y_test[idx])
explanation = explainer.explain_instance(
x_test[idx],
model_3.predict_proba,
num_features=len(list_of_features))
explanation.show_in_notebook()
A note on how to see what LIME is telling us:
Note: due to the way that ML models train, and due to the shuffling of the dataset, these results will slightly vary every time this notebook is run. I did not implement a random seed to preserve the real-time operation of this code, should someone want to download this notebook and run/tweak it themselves.
From all the analysis we've done, including studying the correlations to Attrition_Flag on the heatmap and interpreting what LIME has revealed to us about our XGBoost model's understanding of influential features, we can identify that among the most influential are the following:
Total_Revolving_Bal - the lower it is, the greater the risk of attritionTotal_Amt_Chng_Q4_Q1 - the lower it is, the greater the risk of attritionMonths_Inactive_12_mon - the higher it is, the greater the risk of attritionContacts_Count_12_mon - the higher it is, the more dissatisfied a customer seems to beAmt_Per_Trans - the higher it is, the smaller the chance of attrition.All of these are factors that make sense, and being able to act on incentivizing credit card customers accordingly and maintaining a good relationship with them can help the bank minimize its attrition.
And not only have we identified key factors that result in attrition, but we also have an explainable way to predict if someone is likely to attrite, given the data we need about them. This enhances the real-world usability of a pipeline like this.
This isn't the shortest data science tutorial, as far as they go. However, the problem we tackled was an actual business need in terms of not only having a solution that could guess which customers were likely to leave, but also knowing what factors influenced that customer's decision, according to our explainability framework.
We've gone over a wide range of concepts and tools, spanning from organizing our variables to taking a look at their distributions to identifying interdependence within the "independent" set of feature variables to pinpointing causes of multicollinearity to cleaning those variables up in feature engineering to running three machine learning models on them and finally providing our champion model with the ability to explain its predictions.
As we can imagine, the usability of such a solution (albeit hopefully with a slightly better-performing model) in the real world is not only rooted in preditive capability, but in trust. Not just the business user's trust in what was a black-box algorithm, but the intrinsic idea that a human can trust a self-teaching intelligent solution to explain itself and its decisions.
With a more detailed field-specific understanding (knowing the applied meaning of the features we had), more extensive search for the right model, a more rigorous hyperparameter tweaking process, this data science pipeline could be improved. But hopefully, this tutorial has given you some insight as to what goes into code that decodes.
Thank you for taking the time to go through this, and below are some links that you can use to explore the tools and ideas we used even further.
Since this notebook was created in Google Colab, the code cell below is simply to help me download it as a .html file to submit as my CMSC320 final project for Spring 2021 at UMD.
!jupyter nbconvert --to html /content/CMSC320_Final_BankChurners.ipynb